AITopics | Yakutsk

Collaborating Authors

Yakutsk

OptimalThinkingBench: Evaluating Over and Underthinking in LLMs

Aggarwal, Pranjal, Kim, Seungone, Lanchantin, Jack, Welleck, Sean, Weston, Jason, Kulikov, Ilia, Saha, Swarnadeep

arXiv.org Artificial IntelligenceOct-7-2025

Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. This has led to the development of separate thinking and non-thinking LLM variants, leaving the onus of selecting the optimal model for each query on the end user. We introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs and also encourages the development of optimally-thinking models that balance performance and efficiency. Our benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple math and general queries in 72 domains, and UnderthinkingBench, containing 11 challenging reasoning tasks along with harder math problems. Using novel thinking-adjusted accuracy metrics, we extensively evaluate 33 different thinking and non-thinking models and show that no model is able to optimally think on our benchmark. Thinking models often overthink for hundreds of tokens on the simplest user queries without improving performance. In contrast, large non-thinking models underthink, often falling short of much smaller thinking models. We further explore several methods to encourage optimal thinking, but find that these approaches often improve on one sub-benchmark at the expense of the other, highlighting the need for better unified and optimal models in the future.

accuracy, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2508.13141

Country:

Europe > Russia > Northwestern Federal District > Kaliningrad Oblast > Kaliningrad (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States (0.04)
(9 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment (1.00)
Health & Medicine (1.00)
Media > Music (0.94)
Education (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data

Maekawa, Seiji, Iso, Hayate, Bhutani, Nikita

arXiv.org Artificial IntelligenceOct-15-2024

The rapid increase in textual information means we need more efficient methods to sift through, organize, and understand it all. While retrieval-augmented generation (RAG) models excel in accessing information from large document collections, they struggle with complex tasks that require aggregation and reasoning over information spanning across multiple documents--what we call holistic reasoning. Long-context language models (LCLMs) have great potential for managing large-scale documents, but their holistic reasoning capabilities remain unclear. In this work, we introduce HoloBench, a novel framework that brings database reasoning operations into text-based contexts, making it easier to systematically evaluate how LCLMs handle holistic reasoning across large documents. Our approach adjusts key factors such as context length, information density, distribution of information, and query complexity to evaluate LCLMs comprehensively. Our experiments show that the amount of information in the context has a bigger influence on LCLM performance than the actual context length. Furthermore, the complexity of queries affects performance more than the amount of information, particularly for different types of queries. Interestingly, queries that involve finding maximum or minimum values are easier for LCLMs and are less affected by context length, even though they pose challenges for RAG systems. However, tasks requiring the aggregation of multiple pieces of information show a noticeable drop in accuracy as context length increases. Additionally, we find that while grouping relevant information generally improves performance, the optimal positioning varies across models. Our findings surface both the advancements and the ongoing challenges in achieving a holistic understanding of long contexts.

information, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.11996

Country:

North America > United States > California > Sonoma County (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(11 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Transportation > Infrastructure & Services > Airport (1.00)
Transportation > Air (1.00)
Consumer Products & Services (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Last Invention of Man - Issue 53: Monsters

NautilusOct-5-2017, 20:40:19 GMT

The Omega Team was the soul of the company. Whereas the rest of the enterprise brought in the money to keep things going, by various commercial applications of narrow AI, the Omega Team pushed ahead in their quest for what had always been the CEO's dream: building general artificial intelligence. Most other employees viewed "the Omegas," as they affectionately called them, as a bunch of pie-in-the-sky dreamers, perpetually decades away from their goal. They happily indulged them, however, because they liked the prestige that the cutting-edge work of the Omegas gave their company, and they also appreciated the improved algorithms that the Omegas occasionally gave them. What they didn't realize was that the Omegas had carefully crafted their image to hide a secret: They were extremely close to pulling off the most audacious plan in human history. Their charismatic CEO had handpicked them not only for being brilliant researchers, but also for ambition, idealism, and a strong commitment to helping humanity. He reminded them that their plan was extremely dangerous, and that if powerful governments found out, they would do virtually anything--including kidnapping--to shut them down or, preferably, to steal their code. But they were all in, 100 percent, for much the same reason that many of the world's top physicists joined the Manhattan Project to develop nuclear weapons: They were convinced that if they didn't do it first, someone less idealistic would. The AI they had built, nicknamed Prometheus, kept getting more capable. Although its cognitive abilities still lagged far behind those of humans in many areas, for example, social skills, the Omegas had pushed hard to make it extraordinary at one particular task: programming AI systems. They'd deliberately chosen this strategy because they had bought the intelligence explosion argument made by the British mathematician Irving Good back in 1965: "Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man, however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an'intelligence explosion,' and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control."

artificial intelligence, prometheus, social media, (18 more...)

Nautilus

Country:

North America > United States > California (0.04)
Europe > Russia (0.04)
Asia > South Korea (0.04)
(3 more...)

Genre: Research Report > Experimental Study (0.46)

Industry:

Media > News (1.00)
Media > Film (1.00)
Leisure & Entertainment (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Communications > Social Media (0.93)

Add feedback